Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

koaning / scikit-lego Public

Notifications You must be signed in to change notification settings
Fork 118
Star 1.3k

Code
Issues 34
Pull requests 4
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

`GroupedPredictor` refactoring #618

Closed

FBruzzesi wants to merge 5 commits into koaning:main from FBruzzesi:feature/grouped-predictors

Closed

`GroupedPredictor` refactoring #618

FBruzzesi wants to merge 5 commits into koaning:main from FBruzzesi:feature/grouped-predictors

Conversation 11 Commits 5 Checks 14 Files changed

Conversation

Copy link

Collaborator

FBruzzesi commented Feb 13, 2024

Description

This is a first attempt to refactor GroupedPredictor class (personally one of my favorite features 😁) following the issue raised in #616 .

While working on this I noticed another set of issues with the implementation:

Following [BUG] GroupedPredictor inconsistency for predict_proba having different classes per group #579, decision function will yield wrong results. This is due to the following fact: if the two groups are different classification problems, namely binary and multiclass (as it happens in the unittests), the result of .decision_function(...) will be a 1d or 2d array (resp.), which when concatenated lead to wrong behaviour.
- This PR falls short in fixing that, yet I have a couple of ideas that needs test/triage
Consider the following example when multiple groups are present: let groups=["a", "b"] with values (0, 0), (0, 1) and (1, 0). Currently if at prediction time we encounter (a=0, b=2) for which we don't have a trained model, we are falling back to the global one. However I would argue that we should fallback to the model trained on a=0.
- The PR implements this behaviour, however I would like to have support to decide how to implement it in a non-breaking manner

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the style guidelines (flake8)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (also to the readme.md)
I have added tests that prove my fix is effective or that my feature works
I have added tests to check whether the new feature adheres to the sklearn convention
New and existing unit tests pass locally with my changes

TODOs

The current tests will break due to some API changes
Some steps could be quite unclear

Sorry, something went wrong.

All reactions

FBruzzesi added 5 commits

February 11, 2024 18:14

WIP

51ed92b


          Eureka 🚀

2dc831f


          started docstrings

db72f33


          decision function is a mess

20e6ecf


          cleanup and notebook

3bc7306

FBruzzesi commented

View reviewed changes

sklego/meta/_grouped_utils.py

                   # The grouping part we always want as a DataFrame with range index
                   return X_group.reset_index(drop=True)
+              def _get_estimator(estimators, grp_values, grp_names, return_level, fallback_method):

Copy link

Collaborator Author

FBruzzesi Feb 13, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this function is to determine which estimator to use to predict.

if fallback_method ="raise", we have to have the model for the group we are predicting
if fallback_method ="next", we check for recursively for the first available parent
if fallback_method ="global", we summon the global model.

The point of returning a return_level is a trick to know how far back we went, and used to slice an array afterwards (more comments where this happens)

Sorry, something went wrong.

All reactions

FBruzzesi commented

View reviewed changes

sklego/meta/grouped_predictor.py

-                      if y is not None:
-                          y = check_array(y, ensure_2d=False)
+                      # TODO: Validate class params?

Copy link

Collaborator Author

FBruzzesi Feb 13, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question

Sorry, something went wrong.

All reactions

FBruzzesi commented

View reviewed changes

sklego/meta/grouped_predictor.py

-                      if self.shrinkage is not None:
-                          self.__set_shrinkage_function()
+                      if is_classifier(self.estimator):
+                          self.classes_ = np.sort(np.unique(y))  # TODO: Must be sequential for the rest of the code to work

Copy link

Collaborator Author

FBruzzesi Feb 13, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If y has classes in random order we fall short, should we enforce that?

Sorry, something went wrong.

All reactions

FBruzzesi commented

View reviewed changes

sklego/meta/grouped_predictor.py

Comment on lines +134 to +135

		# TODO: __grouped_predictor_target_value__?
		frame = pd.DataFrame(X).assign(__target_value__=np.array(y)).reset_index(drop=True)

Copy link

Collaborator Author

FBruzzesi Feb 13, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__target_value__ (and __global_model__ just right after) could be "safe enough" column names, but we can even prefix with __grouped_predictor or __sklego_grouped_predictor to be extra safe that someone is not using those as column names 😂

Sorry, something went wrong.

All reactions

FBruzzesi commented

View reviewed changes

sklego/meta/grouped_predictor.py

+                  @property
+                  def n_levels_(self):
+                      check_is_fitted(self, ["fitted_levels_"])
+                      return len(self.fitted_levels_)
                   def fit(self, X, y=None):

Copy link

Collaborator Author

FBruzzesi Feb 13, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fit routine does the following:

Creates a dataframe with X and y with index being range(0, len(x)
- If X was already a dataframe it maintains the columns and we just overwrite the index
- If X was an array, the column names will coincide with the column indexes
Do some checking on the X,y values
Add a dummy global model column if required (with a fixed value of 1)
Based on the arguments (use_global_model, shrinkage and fallback_method) we determine which levels/models need to be fitted by creating a list of lists, where the inner lists are the columns to groupby on
We train a model for each one of this level/group and their values
- End up with a dict of key=group_value, value=fitted_estimator
Define the shrinkage function and factors.
- If shrinkage is None, instead of doing nothing, I add a factor which is zero everywhere expect for the model trained to be 1.

Sorry, something went wrong.

All reactions

FBruzzesi commented

View reviewed changes

sklego/meta/grouped_predictor.py


		return self

		def __set_fit_levels(self):

Copy link

Collaborator Author

FBruzzesi Feb 13, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Send help here 😁

Sorry, something went wrong.

All reactions

FBruzzesi commented

View reviewed changes

sklego/meta/grouped_predictor.py Show resolved Hide resolved

FBruzzesi commented

View reviewed changes

sklego/meta/grouped_predictor.py

Comment on lines +264 to +265

		for grp_names in self.fitted_levels_:
		for grp_values, grp_frame in frame.groupby(grp_names):

Copy link

Collaborator Author

FBruzzesi Feb 13, 2024

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example here could be:

groups = ["a", "b"]
fitted_levels_ = [["__global_model__"], ["__global_model__", "a"], ["__global_model__", "a", "b"]]

hence grp_names goes from outer to inner and grp_values are the unique values identifying the group

Sorry, something went wrong.

All reactions

FBruzzesi commented

View reviewed changes

sklego/meta/grouped_predictor.py

+                          if result.shape != expected_shape:
+                              raise ValueError(f"shrinkage_function({group_lengths}).shape should be {expected_shape}")
+                  def __predict_estimators(self, X, method_name):

Copy link

Collaborator Author

FBruzzesi Feb 13, 2024 •

edited

Loading

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a headache but allows for a unified approach:

preds is a 3d array of (n_samples, n_levels, n_classes) where for regression or od n_classes can be considered to be 1. This will be populated with a prediction for each level
shrinkage is a 2d array (n_samples, n_levels) as shrinkage factor is already one per level. This will get the shrinkage to use for each sample. Since we are doing predictions from outer to inner most level, we are overwriting when needed.
The _get_estimator returning the level allows to select which model shrinkage to use for that particular prediction.
Finally we multiply preds and shrinkage and sum over all levels.
- If shrinkage is none, the array will be only zeros and ones, therefore it is equivalent of selecting the model to use
- If shrinkage is not none, then it is equivalent over averaging with shrinkage factors.

Remark: last_dim_ix is used for the case in #579, using the estimator classes let us index the columns/classes to which assign the results

Decision Function

Decision function breaks for the following reasons:

For binary classification, it returns a 1D array
For multiclass it returns a 2D array

Therefore for the mix case of group A [0,1,2] and group B [0, 3] it has two different output shapes and most importantly different meaning.

However:

For the "normal" multiclass cases this implementation works fine, since we are filling the whole preds
For binary classification it breaks due to the fact that the initialization of preds is which n_classes (=2), yet it is enough to treat this as the regression case.
The mixed case is just painful - and actually wrong in the current api as well

Sorry, something went wrong.

All reactions

koaning reviewed

View reviewed changes

sklego/meta/_grouped_utils.py

@@ @@ -45,13 +46,15 @@ def _split_groups_and_values( @@
                   _shape_check(X, min_value_cols)
                   try:
+                      lgroups = as_list(groups)

Copy link

Owner

koaning Feb 13, 2024 •

edited

Loading

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should set group_sizes to also be a non-list type in the function definition? Bit of a nit this one.

Also: maybe groups_list instead of lgroups.

Sorry, something went wrong.

All reactions

koaning reviewed

View reviewed changes

sklego/meta/_grouped_utils.py

+                              )
+                          return _get_estimator(estimators, grp_values[:-1], grp_names[:-1], return_level - 1, fallback_method)
+                  else:  # fallback_method == "global"

Copy link

Owner

koaning Feb 13, 2024 •

edited

Loading

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: technically the else isn't needed anymore because the function would've returned otherwise. ~~Might be nicer to confirm at the start of the function that a correct fallback method is chosen.~~

Just noticed we check this elsewhere, so it's probably fine to not check here.

Sorry, something went wrong.

All reactions

FBruzzesi mentioned this pull request

HierarchicalPredictor and HierarchicalTransformer #620

Closed

FBruzzesi closed this

FBruzzesi mentioned this pull request

[BUG] Error when calling predict_proba with GroupedPredictor using shrinkage and global model #616

Closed

FBruzzesi deleted the feature/grouped-predictors branch

April 10, 2024 08:43

FBruzzesi mentioned this pull request

[FEATURE] Parallelized Grouped Predictor #711

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

koaning koaning left review comments

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.